Audio-visual speech enhancement with AVCDCN (audio-visual codebook dependent cepstral normalization)
نویسندگان
چکیده
In this paper, we introduce a non-linear enhancement technique called Audio-Visual Codebook Dependent Cepstral Normalization (AVCDCN) and we consider its use with both audio-only and audio-visual speech recognition. AVCDCN is inspired from CDCN [1] [2], an audio-only enhancement technique that approximates the non-linear effect of noise on speech with a piece-wise constant function. Our experiments show that the use of visual information in AVCDCN allows significant performance gains over CDCN. 1. AUDIO-VISUAL APPROACH TO SPEECH RECOGNITION Although current automatic speech recognition (ASR) systems perform remarkably well for a variety of recognition tasks in clean audio conditions, their accuracy degrades with increasing levels of environment noise. New approaches are needed to handle the ASR lack of robustness to noise. In this paper, we propose a multi-sensor approach to ASR, where visual information, in addition to the standard audio information, is obtained from the speaker’s face in a second channel. Audio-visual ASR, where both an audio channel and a visual channel are input to the recognition system, has already been demonstrated to outperform traditional audioonly ASR in noise conditions [5] [6]. In addition to audiovisual ASR, the visual modality has been investigated as a means of enhancement, where clean audio features are estimated from audio-visual speech when the audio channel is corrupted by noise [3] [4]. However, in [4] for example, the ASR performance of linear audio-visual enhancement (where clean audio features are estimated via linear filtering of the noisy audio-visual features) remains significantly inferior to the performance of audio-visual ASR. In this paper, we introduce a non-linear enhancement technique called Audio-Visual Codebook Dependent Cepstral Normalization (AVCDCN) and we consider its use with both audioonly ASR and audio-visual ASR. AVCDCN is inspired from CDCN [1] [2], an audio-only non-linear enhancement technique which is well known in the field of ASR. In CDCN, the non-linear effect of the noise on the clean speech features is approximated with a piece-wise constant function. AVCDCN is a multi-sensor extension of CDCN that integrates the use of audio and visual features. Our experiments show that the use of visual information in AVCDCN allows significant performance gains over CDCN. 2. PRINCIPLE OF AVCDCN Let’s denote a cepstral vector of audio features corrupted by noise and observed at time , the unknown vector of noise features and the unknown vector of clean speech features that would have been observed in the absence of noise. The principle of CDCN [1] [2] is to compute an estimate of as the expected value of given the observed noisy features :
منابع مشابه
Comparing the Impact of Audio-Visual Input Enhancement on Collocation Learning in Traditional and Mobile Learning Contexts
: This study investigated the impact of audio-visual input enhancement teaching techniques on improving English as Foreign Language (EFL) learnersˈ collocation learning as well as their accuracy concerning collocation use in narrative writing. In addition, it compared the impact and efficiency of audio-visual input enhancement in two learning contexts, namely traditional and mo...
متن کاملHMM-based visual speech recognition using intensity and location normalization
This paper describes intensity and location normalization techniques for improving the performance of visual speech recognizers used in audio-visual speech recognition. For auditory speech recognition, there exist many methods for dealing with channel characteristics and speaker individualities, e.g., CMN (cepstral mean normalization), SAT (speaker adaptive training). We present two techniques ...
متن کاملSpeaker-independent 3D face synthesis driven by speech and text
In this study, a complete system that generates visual speech by synthesizing 3D face points has been implemented. The estimated face points drive MPEG-4 facial animation. This system is speaker independent and can be driven by audio or both audio and text. The synthesis of visual speech was realized by a codebook-based technique, which is trained with audio-visual data from a speaker. An audio...
متن کاملA Link between Cepstral Shrink Product Rule in Audio-visual
The weighted product rule has been shown empirically to be of great benefit in audio-visual speech recognition (AVSR), for isolated word recognition tasks. A firm theoretical basis for the selection of effective weights is of considerable interest to the audio-visual speech processing community. In this paper a clear link is established between the selection of effective weightings and the appr...
متن کاملCharacteristics of the Use of Coupled Hidden Markov Models for Audio-Visual Polish Speech Recognition
This paper focuses on combining audio-visual signals for Polish speech recognition in conditions of highly disturbed audio speech signal. Recognition of audio-visual speech was based on combined hidden Markov models (CHMM). Described methods where developed for a single isolated command, nevertheless their effectiveness indicated that they would also work similarly in continuous audio-visual sp...
متن کامل